Faster K-Means Cluster Estimation
نویسندگان
چکیده
K-means is a widely used iterative clustering algorithm. There has been considerable work on improving k-means in terms of mean squared error (MSE) and speed, both. However, most of the k-means variants tend to compute distance of each data point to each cluster centroid for every iteration. We propose two heuristics to overcome this bottleneck and speed up k-means. Our first heuristic predicts the candidate clusters for each data point by looking at nearby clusters after first iteration of k-means. Our second heuristic further reduces this candidate cluster list aggressively. We augment well known variants of k-means with our heuristics to demonstrate effectiveness of our heuristics. For various synthetic and real-world datasets, our heuristics achieve speed-up of up-to 10 times without significant increase in MSE.
منابع مشابه
Inter Cluster Distance Management Model with Optimal Centroid Estimation for K-Means Clustering Algorithm
Clustering techniques are used to group up the transactions based on the relevancy. Cluster analysis is one of the primary data analysis method. The clustering process can be done in two ways such that Hierarchical clusters and partition clustering. Hierarchical clustering technique uses the structure and data values. The partition clustering technique uses the data similarity factors. Transact...
متن کاملX-means: Extending K-means with Eecient Estimation of the Number of Clusters
Despite its popularity for general clustering, K-means suuers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the rst two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, we int...
متن کاملشناسایی الگوی رفتار مردم در اهدای خون با استفاده از الگوریتم K-Means مبتنی بر تازگی، بسامد و ارزش خون
Introduction: Blood donation rate in developed countries is 18 times higher than developing countries. It is estimated that if only five percent of Iran population embark on blood donation, it will be adequate to meet the needs of the community. The aim of this paper is to identify the blood donators’ loyalty behavior for proper planning to extend and enhance blood donation habits among t...
متن کاملScalable and Distributed Clustering via Lightweight Coresets
Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight cor...
متن کاملAn Efficient Document Clustering Based on HUBNESS Proportional K-Means Algorithm
Evaluating similarity between the documents is a main operation in the text processing field. Similarity measurement is used to estimate the relationship between the records or documents.In existing system similarity between two documents can be computed with respect to feature by using Similarity Measure for Text Processing (SMTP). In proposed hybrid SMTP scheme is integrated with hubness base...
متن کامل